sceptre part 7In this writeup I examine the following:
How does sample size vary across datasets?
How does the “exact” version of SCEPTRE (i.e., the version of SCEPTRE that returns an “exact” p-value based on B = 300,000 resamples) perform on the Frangieh IFN-gamma, Papalexi gene, and Papalexi protein data?
A question that I seek to answer is whether SCEPTRE exhibits CAMP, or “Confounder Adjustment via Marginal Permutations” (previously known as implicit confounder adjustment).
Sample size has emerged as an important quantity in low MOI (and even in high MOI). Here, I study issues related to sample size in a more precise and in-depth way than I have previously.
I begin by defining several quantities –
n_nonzero_treatment, n_nonzero_control, and
effective_sample_size – on the negative control data. Let a
dataset be given. Let \(d\) be the
number of NTCs, let \(p\) be the number
of genes, and let \(n\) be the number
of cells in the dataset. Let \(X \in
\mathbb{R}^{n \times p}\) be the cell-by-gene expression matrix.
Next, for \(k \in \{1, …, d\},\) let
\(n_k\) be the number of cells
containing NTC \(k\) (where the NTCs
are arbitrarily indexed). Also, let \(X^{1}
\in \mathbb{R}^{n_1 \times p}\) be the submatrix of \(X\) consisting of the cells that received
NT 1 (and similarly for \(X^2, \dots,
X^d\)). For a given NTC \(k\)
and gene \(j\), let
$$\texttt{n_nonzero_treatment}_{kj}$$ be the number of cells containing
NTC \(k\) in which gene \(j\) has nonzero expression, i.e.
\[ \texttt{n_nonzero_treatment}_{jk} = \sum_{i=1}^{n_k} \mathbb{I}(X^k_{i,j} \geq 1).\]
Next, let \(\texttt{n_nonzero_control}_{kj}\) be the number of NT-containg cells excluding NTC \(k\) (i.e., the cells containing NTCs in the set \([d] \setminus \{k\}\)) with nonzero expression, i.e.,
\[ \texttt{n_nonzero_control}_{jk} = \sum_{k' \neq k} \texttt{n_nonzero_treatment}_{k'j} \]
Finally, let effective_sample_size be the sum of
n_nonzero_treatment and n_nonzero_control. On
the negative control data, effective_sample_size is a
function of the gene only:
\[ \texttt{effective_sample_size}_j = \sum_{k = 1}^d \texttt{n_nonzero_treatment}_{jk} \]
I study how effective_sample_size,
n_nonzero_treatment, and n_nonzero_control
relate to one another.
effective_sample_size,
n_nonzero_treatment, and
n_nonzero_controlFirst, I plot n_nonzero_treatment against
effective_sample_size, faceting by dataset. In other words,
I plot
\[ \{ (\texttt{n_nonzero_treatment}_{jk}, \texttt{effective_sample_size}_{j})\}_{j,k} \]
for each dataset.
Clearly, n_nonzero_treatment and
effective_sample_size are correlated. This makes sense: for
a given gene, if expression is high in cells with a given NTC, then
expression likely is high in cells with the remaining NTCs. (Put
differently, expression levels are correlated across NTCs: highly
expressed genes tend to be highly expressed across NTCs, and similarly
for lowly expressed genes.)
Next, I plot n_nonzero_control against
effective_sample_size, i.e.
$$\{(\texttt{n_nonzero_control}_{jk}, \texttt{effective_sample_size}_{j})\}_{j,k}$$
Clearly, n_nonzero_control and
effective_sample_size are basically the same variable.
Again, this makes sense: for a given gRNA,
n_nonzero_control is equal to
effective_sample_size minus the number of cells with
nonzero expression containing that gRNA. Finally, I plot
n_nonzero_treatment against
n_nonzero_control.
Again, these variables are correlated (but certainly not perfectly)
for the same reason that n_nonzero_treatment and
effective_sample_size are correlated.
I conclude on the basis of this analysis that, while
effective_sample_size is a useful single-number summary of
sample size, it is meaningful to look at both at
n_nonzero_treatment and n_nonzero_control when
examining sample size-related issues.
Here, I make some comparisons related to sample size across datasets. First, I make a barplot of the number of NTCs per dataset.
The Frangieh data contain the greatest number of NTCs, followed by
the Schraivogel data and then the Papalexi data. Next, I plot the
average number of nonzero cells per NTC (i.e.,
n_nonzero_treatment) across datasets.
I print the median n_nonzero_treatment value for each
dataset below.
sample_size_df_nt |>
group_by(dataset) |>
summarize(m = median(n_nonzero_treatment))
## # A tibble: 7 × 2
## dataset m
## <fct> <dbl>
## 1 frangieh_co_culture_gene 9
## 2 frangieh_control_gene 6
## 3 frangieh_ifn_gamma_gene 7
## 4 papalexi_eccite_screen_gene 29
## 5 schraivogel_enhancer_screen 52.5
## 6 schraivogel_ground_truth_perturbseq_gene 37
## 7 schraivogel_ground_truth_tapseq_gene 165
The Frangieh data have the smallest number nonzero cells per NTC
(median under 10). The Papalexi data are indermediate (median about 30),
and the Schraivogel data have the greatest number of cells per NTC
(median 37 on perturb-seq data; even greater median on other datasets).
Finally, I plot the effective_sample_size per dataset.
I print the per-dataset median effective sample size below.
sample_size_df_nt |>
group_by(dataset) |>
summarize(m = median(effective_samp_size))
## # A tibble: 7 × 2
## dataset m
## <fct> <dbl>
## 1 frangieh_co_culture_gene 712
## 2 frangieh_control_gene 436
## 3 frangieh_ifn_gamma_gene 581
## 4 papalexi_eccite_screen_gene 315
## 5 schraivogel_enhancer_screen 1764
## 6 schraivogel_ground_truth_perturbseq_gene 1186
## 7 schraivogel_ground_truth_tapseq_gene 5775
We see that the effective sample sizes are in a similar range on the Frangieh and Papalexi datasets (while the effective sample sizes on the Schraivogel datasets are larger). The Frangieh data have fewer nonzero cells per NTC but many more NTCs than the Papalexi data, causing the effective sample size to be roughly equal across the Frangieh and Papalexi data.
Now, I examine the number of nonzero cells per targeting gRNA (grouped and ungrouped) across datasets.
sample_size_df_t |>
group_by(dataset) |>
summarize(median_nonzero = median(n_nonzero_cells))
## # A tibble: 7 × 2
## dataset median_nonzero
## <fct> <dbl>
## 1 frangieh_co_culture_gene 5
## 2 frangieh_control_gene 3
## 3 frangieh_ifn_gamma_gene 4
## 4 papalexi_eccite_screen_gene 17
## 5 schraivogel_enhancer_screen 14
## 6 schraivogel_ground_truth_perturbseq_gene 26
## 7 schraivogel_ground_truth_tapseq_gene 117
Next, I plot the number of nonzero cells per gRNA group:
grouped_ss_df <- sample_size_df_t |>
group_by(dataset, grna_group, response_id) |>
summarize(s = sum(n_nonzero_cells))
## `summarise()` has grouped output by 'dataset', 'grna_group'. You can override
## using the `.groups` argument.
grouped_ss_df$dataset_rename <- stringr::str_to_title(gsub(pattern = "_",
replacement = " ",
x = grouped_ss_df$dataset))
paper <- factor(x = grouped_ss_df$dataset,
levels = c("frangieh_co_culture_gene",
"frangieh_control_gene",
"frangieh_ifn_gamma_gene",
"papalexi_eccite_screen_gene",
"schraivogel_ground_truth_perturbseq_gene",
"schraivogel_enhancer_screen",
"schraivogel_ground_truth_tapseq_gene"),
labels = c(rep("Frangieh", 3), "Papalexi", rep("Schraivogel", 3)))
grouped_ss_df$paper <- paper
grouped_ss_df |>
group_by(dataset) |>
sample_n(10000, replace = TRUE) |>
ggplot(mapping = aes(x = dataset_rename,
y = s + 1,
fill = paper)) +
geom_violin() +
geom_boxplot() +
scale_y_log10() +
ylab("N nonzero cells per targeting gRNA (ungrouped)") +
xlab("") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
grouped_ss_df
## # A tibble: 11,769,880 × 6
## # Groups: dataset, grna_group [2,607]
## dataset grna_group response_id s dataset_rename paper
## <fct> <fct> <fct> <int> <chr> <fct>
## 1 frangieh_co_culture_gene A2M A1BG 43 Frangieh Co Cult… Fran…
## 2 frangieh_co_culture_gene A2M A1BG-AS1 15 Frangieh Co Cult… Fran…
## 3 frangieh_co_culture_gene A2M A4GALT 2 Frangieh Co Cult… Fran…
## 4 frangieh_co_culture_gene A2M AAAS 67 Frangieh Co Cult… Fran…
## 5 frangieh_co_culture_gene A2M AACS 20 Frangieh Co Cult… Fran…
## 6 frangieh_co_culture_gene A2M AADAT 13 Frangieh Co Cult… Fran…
## 7 frangieh_co_culture_gene A2M AAED1 36 Frangieh Co Cult… Fran…
## 8 frangieh_co_culture_gene A2M AAGAB 84 Frangieh Co Cult… Fran…
## 9 frangieh_co_culture_gene A2M AAK1 84 Frangieh Co Cult… Fran…
## 10 frangieh_co_culture_gene A2M AAMDC 56 Frangieh Co Cult… Fran…
## # … with 11,769,870 more rows
First, I plot the number of genes per dataset (after applying mild QC of filtering out genes expressed in fewer than 0.005 of cells).
All datasets have roughly the same number of genes (about 15,000) except for the TAP-seq and enhancer screen datasets (which of course have fewer genes). Next, I plot the number of pairs across datasets.
The Frangieh data of course contain more pairs because they contain more NT gRNAs.
I applied the exact version of SCEPTRE (with covariates and using B = 300,000 permutations) to the Papalexi data and Frangieh data.
First, I plot the SCEPTRE exact results on the Frangieh data, comparing to the skew-normal version of SCEPTRE (with covariates) and NB regression.
SCEPTRE (exact) exhibits good calibration, enabling us to conclude that a permutation test works on the Frangieh data. The other methods (NB regression and SCEPTRE (SN)) are not as well calibrated, likely because asymptotics have not yet kicked in. Next, I plot the results on the Papalexi gene expression data, first on an untransformed scale.
Clearly, in the bulk of the distribution, SCEPTRE (exact) is doing the best, followed by NB regression and then by SCEPTRE (SN). I now plot the same results on a transformed scale.
We see that on the transformed scale, SCEPTRE (exact) performs the worst (beyond about 0.001), while NB regression is the best! What is going on here? In particular, why is SCEPTRE (exact) performing worse than SCEPTRE (SN) in the tail? Below, I plot the SCEPTRE (exact) p-values against the SCEPTRE (SN) p-values on a transformed scale.
The SCEPTRE (SN) p-values are in broad agreement with the SCEPTRE (exact) p-values; however, the scatterplot has an obvious “appendage” below the diagonal. The pairs in this “appendage” have much smaller SCEPTRE (exact) p-values than SCEPTRE (SN) p-values. Why does this appendage appear? And are the pairs in this appendage “qualitatively different” (in some way) than the other pairs? I plot the points in the appendage below and pick a few to investigate.